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ABSTRACT 


Let  (X  '<i,V^  tie  a  random  variable  where  y  denotes  a  response  on  the 

1  K 

vector  X  of  predictor  variables.  In  this  paper  we  propose  a  technique  (termed 
ADE)  for  studying  the  mean  response  m(x)=E(y|x)  through  the  estimation  of  the 
k-vector  of  average  derivatives  5=E(m').  The  ADE  procedure   involves  two 
stages:  first  estimate  5  using  an  estimator  6.    and  then  estimate  m(x)  as 
m(x)=g(x  8).    where  g  is  an  estimator  of  the  univariate  regression  of  y  on  x  5 . 
We  argue  that  the  ADE  procedure  exhibits  several  attractive  characteristics: 

data  summarization  through  interpretable  coefficients,  graphical  depiction  of 

T' 
the  possible  nonlinearlty  between  y  and  x  5 ,  and  theoretical  properties 

consistent  with  dimension  reduction.  We  motivate  the  ADE  procedure  using 

T 

examples  of  models  that  take  the  form  m(x)=g(x  $).  In  this  framework,  5  is 
shown  to  be  proportional  to  p,  and  m(x)  infers  m(x)  exactly. 

The  focus  of  the  procedure  is  on  the  estimator  5,  which  Is  based  on  a 
simple  average  of  kernel  smoothers,  and  is  shown  to  be  a  ^  consistent  and 
asymptotically  normal  estimator  of  S.    The  estimator  g( .  )  is  a  standard  kernel 

regression  estimator,  and  is  shown  to  have  the  same  properties  as  the  kernel 

T 
regression  of  y  on  x  5 .  In  sum.  the  estimator  S    converges  to  .5  at  the  rate 

typically  available  in  parametric  estimation  problems,  and  m(x)  converges  at 

the  optimal  one-dimensional  nonparametric  rate. 

We  study  the  ADE  estimators  using  Monte  Carlo  analysis,  using  sample 

designs  with  k=4  predictor  variables.  The  ADE  estimators  perform  well  in 

samples  of  size  N=50  generated  from  a  linear  models,  and  samples  of  size  N=100 

generated  by  a  highly  nonlinear  model.  For  the  latter  samples,  the  ADE 

procedure  is  seen  to  have  desirable  goodness-of-f it  and  data  summarization 

features  relative  to  a  multivariate  regression  smoother. 


INVESTIGATING  SMOOTH  MULTIPLE  REGRESSION  BY  THE  METHOD  OF  AVERAGE  DERIVATIVES 
by  Wolfgaiie  Hardle  and  Thomas  M.  Stoker 

1 .  Introduction 

The  popularity  of  linear  modeling  in  empirical  analysis  is  based,  in 

large  part,  on  the  ease  with  which  the  results  can  be  interpreted.  This 

tradition  influenced  the  modeling  of  various  parametric  nonlinear  regression 

relationships,  where  the  mean  response  variable  is  assumed  to  be  a  nonlinear 

function  of  a  weighted  sum  of  the  predictor  variables.  As  in  linear  modeling, 

this  feature  is  attractive  because  the  coefficients,  or  weights  of  the  sum. 

give  a  simple  picture  of  the  relative  impacts  of  the  individual  predictor 

variables  on  the  response  variable.  In  this  paper  we  propose  a  flexible  method 

of  studying  generaj  mujtivariate  regression  relationships  in  line  with  this 

approach.  Our  method  is  to  first  estimate  a  specific  set  of  coefficients. 

termed  average  derivatives,  ana  then  compute  a  (univariate)  nonparametric 

regression  of  the  response  on  the  weighted  sum  of  predictor  variables. 

The  central  focus  of  this  paper  is  analysis  of  the  average  derivative. 

which  is  defined  as  follows.  Let  (x,y)=(x ^^-y)  denote  a  random  variable, 

i  K 

where  y  is  the  response  studied.  If  the  mean  response  of  y  given  x  is  denoted 
as 

(1.1)  m(x)  =  E(y|x) 

then  the  vector  of  "average  derivatives"  is  given  as 

(1.2)  5  =  E(m' ) 

where  m'=  am/ax  is  the  vector  of  partial  derivatives,  and  expectation  is  taken 
with  respect  to  the  marginal  distribution  of  x.  We  argue  in  the  next  section 


that  8   represents  sensible  "coefficients"  of  changes  in  x  on  y. 

We  construct  a  nonparametric  estimator  5   of  S ,    based  on  an  observed 

random  sample  (x,,y.),  i=l N.  Our  procedure  for  modeling  m(x)  is  to  first 

~    T"  T 

compute  S.    form  the  weighted  sum  z.=x.  5  for  i=l N  (where  x  is  the 

transpose  of  x),  and  then  compute  the  (Nadaraya-Watson)  kernel  estimator  g( . ) 

of  the  regression  of  y.  on  z..  The  regression  function  m(x)  is  then  modeled  as 

(1.3)      m(x)  =  g{x^5) 

The  output  of  the  procedure  is  three-fold:  a  summary  of  the  relative  impacts 

of  changes  in  x  on  y  (via  3).    a  graphical  depiction  of  the  possible 

T 
nonlinearity  between  y  and  the  weighted  sum  x  5  (a  graph  of  g),  and  a  formula 

for  computing  estimates  of  the  mean  response  m(x)  (from  equation  (1.3)).  We 

refer  to  this  as  the  ADE  method,  for  "average  derivative  estimation." 

The  exposition  is  designed  to  show  that  the  ADE  method  has  three 

attractive  features:  data  summarization  through  interpretable  coefficients, 

computational  simplicity  and  theoretical  properties  consistent  with  dimension 

reduction.  The  statistic  S    is  based  on  a  simple  average  of  nonparametric 

kernel  smoothers,  and  its  properties  depend  only  on  regularity  properties  on 

the  joint  density  of  (x,y),  or  in  particular,  on  no  functional  form 

assumptions  on  the  regression  function  m(x).  The  limiting  distribution  of 

>rN(5-5)  is  multivariate  normal.  The  nonparametric  regression  estimator 

T" 
m(x)=g(x  5)  is  constructed  from  a  k-dimensional  predictor  variable,  but  it 

2/5 
achieves  the  optimal  rate  N    that  is  typical  for  one-dimensional  smoothing 

problems  (see  Stone  1980).  While  S   and  g( . )  each  involve  choice  of  a  smoothing 

parameter,  they  are  computed  directly  from  the  data  in  two  steps,  and  thus 

require  no  computer  intensive  iterative  techniques  for  finding  optimal 

objective  function  values. 

Section  2  motivates  the  ADE  method  through  several  applied  examples. 


Section  3  introduces  the  estimators  5  and  g  and  establishes  their  large  sample 
statistical  properties.  Section  4  discusses  the  results,  including  the 
relationship  of  the  ADE  method  to  projection  pursuit  regression  (PPR)  of 
Friedman  and  Stuetzle( 1981 )  and  other  flexible  methods.  The  Monte  Carlo  study 
of  Section  5  shows  the  ADE  method  to  be  well  behaved  in  finite  samples. 
Section  6  follows  with  some  concluding  remarks. 

2.  Motivation  of  the  ADE  Procedure 


The  average  derivative  S    is  most  naturally  interpreted  in  situations 

T 
where  the  influence  of  x  on  y  is  modeled  via  a  weighted  sum  x  p.  where  p  is  a 

~   T 
vector  of  coefficients:  where  the  regression  function  is  m(x)=g(x  fs ) .  In  such 

a  model,  there  is  an  intimate  relationship  between  the  coefficients  p  and  the 

T 

average  derivative  5.  In  particular,  m'  =  [di/d(x  p)J  p.  so  that  S    = 

T 
E[dg/d(x  B)]p  =  Yp.  where  Y  is  a  scalar  (assumed  nonzero).  Consequently,  when 

T 
the  mean  response  is  a  function  of  a  weighted  sum  x  p,  the  average  derivative 

<5  is  a  constant  multiple  of  coefficient  vector  p. 

An  obvious  example  is  the  classical  linear  regression  model: 

T 
y=a+x6^e.  where  e  is  a  random  variable  uncorrelated  with  x.  which  gives 

5=&.  Another  class  of  models  is  those  that  are  linear  up  to  transformations: 

(2.1)      <p(y)  =>+'(x'^|3)  +e 


where  >+'(.)  is  a  non-constant  transformation,  <t>(  .  )  is  invertible,  and  e  is  a 
random  disturbance  that  is  independent  of  x.  Here  we  have  that 

—  IT  T 

m(x)=Ei<p   (^(x  S )+e ) I xj=g( X  p).  The  form  (2.1)  includes  the  model  of  Box- 
Cox(1964).  where  *(  y  )  =  (  y^-^-l) /Xt  and  4'(  x^p  )  =cx- [  (x^p  )^^-l  J/X^  • 

Other  models  exhibiting  this  structure  are  discrete  regression  models, 
where  y  is  1  or  0  according  to 


(2.2)      y  =  l  ife<+(x'^P) 

=  0  if  e  >  H'lx^p) 


Here  the  regression  function  m(x)  is  the  probability  that  y=l.  which  is  given 

T  T 

as  in(x)=Prob{e<4'(  X  p)|x}=g(x  ^).  References  to  specific  examples  of  binary 

response  models  can  be  found  in  Manski  and  McFadden(1981 ) .  Standard  probit 

T      T 
models  are  included  by  assuming  that  ^Ix  p)=a+x  p  and  e  is  a  normal  random 

T 
variable  (with  distribution  function  ^) ,    giving  m(x)=4>(a+x  p).  Logistic 

T  T 

regression  models  are  likewise  included;  here  m(x)=exp(a+x  p)/[l+exp(a+x  p)J. 

Censored  regression,  where 

■2  3)      y  =  +(x^P)  +  e       if  +(x^p)  +  e  >  0 
=  0  if  "i-lx^p)  +  e  <  0 


T      T 
is  likewise  included,  and  setting  ^Kx  p)=a+x  p  gives  the  familiar  censored 

linear  regression  model  (see  Powell(1986)  among  others).  For  further  examples 

from  econometric  modeling,  see  Stoker ( 1986)  . 

A  parametric  approach  to  the  estimation  of  any  of  these  models,  for 

instance  based  on  maximum  iikejihood.  requires  the  (parametric)  specification 

of  the  distribution  of  the  random  variable  e  and  of  the  transformations  >*'(.). 

and  for  (2.1).  the  transformation  <p( . ) .  Substantial  bias  can  result  if  any  of 

these  features  is  incorrectly  specified.  Nonparametric  estimation  of  5=Yfs 

-   T 
avoids  such  restrictive  specifications.  In  fact,  the  form  m(x)=g(x  p) 

generalizes  the  "generalized  linear  models  (GLIM)":  see  McCullagh  and 

Nelder( 1983) .  These  models  have  g  invertible  with  g   referred  to  as  the 

"link"  function.  Other  approaches  that  generalize  GLIM  models  can  be  found  In 

Breiman  and  Friedman( 1985)  and  Hastie  and  Tibshirani  (1986). 

Turning  our  attention  to  ADE  regression  modeling,  we  show  in  the  next 

T        T 
section  that  m(x)  of  (1.3)  will  estimate  g(x  5)=E(y|x  S).    in  general. 


Consequently,  the  ADE  method  will  completely  infer  m(x)  when 


(2.4)      m  (  X  )  =  g  (  X  ^5  ) 


But  it  is  easy  to  see  that  (2.4)  will  always  be  valid  when  m(x)  takes  the  form 

~  T 
m(x)=g(x  p).  as  in  the  above  examples.  Suppose  that  such  a  model  were 

* 
reparameterized  to  have  coefficients  p  =r:p.  where  c  is  a  nonzero  scalar,  then 

^*   T  *        ~*  ~*    ~ 

we  can  write  m(x)=e  (x  6  ),  with  g  defined  as  g  (.)=g(./c).  Thus  when 

~   T 
m(x)=g(x  p),  we  can  equivalently  reparameterize  m(x)  to  have  coefficients 

5=y6.  giving  (2.4)  with  g(.)=g(./Y).  This  corresponds  to  a  normalization 

exhibited  by  the  function  g( . )  of  (2.4).  namely  E[dg/d(x  5)1=1.  To  understand 

* 
this  normalization,  consider  a  reparameterization  of  p  to  p  =cp.  and  note  that 

5=Y6=Y  B  .  where  Y   =  Y/c  =  El.dg  /d(x  p  )J,  with  g  defined  as  above.  Setting 

*  T       * 

c=Y  gives  p  =Yp=5  .  and  we  conclude  that  E[dg/d(x  5)J  =  Y   =  Y/Y  =  1. 

Another  way  of  interpreting  the  scale  of  S    is  to  consider  the  change  in 

the  mean  of  y  when  x  is  translated  to  x+^x  in  the  transformation  model  (2.1). 

T 
The  average  change  is  proportional  to  (^x)  8.    as  it  is  in  the  linear  model. 

Other  scalings  of  the  coefficients  p  make  this  change  dependent  on  4>(  .  )  and 

+  (  .  )  . 


3 .  Kernel  Estimation  of  Average  Derivatives 

Our  approach  to  estimation  of  S   utilizes  nonparametric  estimation  of  the 
marginal  density  of  x.  Let  f(x)  denote  this  marginal  density,  f'saf/ax  the 
vector  of  partial  derivatives,  and  ft  =   -aln  f/ax  =  -  f'/f  the  negative  log- 
density  derivative.  If  f(x)=0  on  the  boundary  of  x  values,  then  integration  by 
parts  gives 

(3.1)      5  =  E(m' )  =  E(Ay) 

Our  estimator  of  5  is  a  sample  analogue  of  the  last  term  in  this  formula. 


using  a  nonparametric  estimator  of  SHx)    evaluated  at  each  observed  vaJue  x. 

i  =  l N.  . 

In  particular,  the  density  function  f(x)  is  estimated  at  x.  using  the 
(Rosenblatt-Parzen)  kernel  density  estimator: 

k   rx  -  X  .I 

\-r] 

j  =  l 


N       k   rx  -  X . 
(3.2)       f^(x)  =  —  I      [—]      k[-^^ 


where  K( . )  is  a  kernel  function,  h=h„  is  the  bandwidth  parameter,  and  h-»0  as 

N 

N-*«>.   The  vector  function  ft(x)  is  then  estimated  at  x.  using  fu('^)  as 

(3.3)      a.ix)    =  -  f,  '(x)/  f,  (X) 
h        h       h 

where  f,  '=af,  /ax  is  an  estimator  of  the  partial  density  derivative.  For  a 

h    n 

suitable  kernel  K( . ) .  under  general  conditions  f,  (x).  f,  '(x)  and  ft,  (x)  are 

h      h         h 

consistent  estimators  of  f(x),  f'(x)  and  ft(x),  respectively. 

Because  of  division  by  f  (x),  the  function  ftj^(x)  may  exhibit  erratic 

behavior  when  the  value  of  f,  is  very  small.  Consequently,  for  estimation  of  5 

we  only  include  terms  for  which  the  value  of  f,  (x.)  is  above  a  bound.  Toward 

h   1 

this  end,  define  the  indicator  I  .  =1 [f  ( x .  )>b] ,  where  I[.J  is  the  indicator 
function,  and  where  b=b^.  is  a  trimming  bound  such  that  b-*0  as  N-»«. 
The  "average  derivative  estimator"  6    is  defined  as: 

N 

I 
i  =  l 

We  derive  the  large  sample  statistical  properties  of  S   on  the  basis  of 

smoothness  conditions  on  m(x)  and  f(x).  The  required  assumptions  (given 

formally  in  the  Appendix)  are  described  as  follows.  As  above,  the  k-vector  x 

is  continuously  distributed  with  density  f(x),  and  f(x)=0  on  the  boundary  of  x 

values.  The  regression  function  m(x)=E(y|x)  is  (a.e.)  continuously 

dif f erentiable .  and  the  second  moments  of  m'  and  fty  exist.  The  density  f(x)  is 


(3.4)      6     --^     I     ^,(x.)y.I.   . 


assumed  to  be  smooth,  having  partial  derivatives  of  order  p>k+2.  The  kernel 
function  K( . )  has  compact  support  and  is  assumed  to  be  of  order  p.  Finally,  we 
assume  some  technical  conditions  on  the  behavior  of  m(x)  and  f(x)  in  the  tails 
of  the  distribution;  for  instance  ruling  out  thick  tails,  and  rapid  increases 
in  m(x)  as  |  x|-»<». 

Under  these  conditions.  5  is  an  asymptotically  normal  estimator  of  5, 
stated  formally  as 

Theorem  3.1:  Given  Assumptions  1.-9.  stated  in  the  Appendix,  if 

(i)  N-»».  h-»0.  b-0.  b"  h-»0. 

(11)  For  some  £>0.  b  .\   h    -♦". 


(iii)  Nh^^  ^-0. 


then  >IN(<5  -  5)  has  a  limiting  normal  distribution  with  mean  0  and  variance  S. 
where  1.  is  the  covariance  matrix  of  r(y.x).  with 

(3.5)      r(y.x)  =  m'(x)  ^    [v   -   m(x)]ii(x) 

The  proof  of  Theorem  3.1.  as  well  as  those  of  the  other  results  of  the  paper, 
are  contained  in  the  Appendix. 

For  the  purpose  of  carrying^  out  inference  on  the  value  of  5  .    the 
covariance  matrix  51  could  be  consistently  estimated  as  the  sample  variance  of 

uniformly  consistent  estimators  of  r(y.,x.).  i=l N.  and  the  latter  could 

be  constructed  using  any  uniformly  consistent  estimators  of  A(x).  m{x)  and 
m'(x).  The  proof  of  Theorem  3.1  suggests  a  more  direct  estimator  of  r(y..x.). 
defined  as 


(3.6) 


1 


N 


r,  .  =  A.  (X.  y .  I  .+  —rr     Y 
hi    hill   N   .■^, 

J  =  l 


h~^"^K' 


X  .  -X  . 

_! I 


h'^'K 


x.-x  . 

_i 1 


ft. (X.) 

h   J  J 


V.I  . 
f^(x.) 


Define  the  estimator  X  of  5]  as 

N 
I 


.   N  .   .  . 

(3.7)      Z  =  -TT  y   r,  .r,  .1. 

N  .^,      hi  hi  1 


55 


We  then  have 


Theorem  3.2:  If  N-**.  h-»0 .  b-*0  and  b  h-»0 .  then  Z  is  a  consistent  estimator  of 


Theorem  3.2  facilitates  inference  on  hypotheses  about  5.  For  instance, 
consider  testing  restrictions  that  certain  components  of  5  are  zero,  or 
equality  restrictions  across  components  of  5.  Such  restrictions  are  captured 
by  the  null  hypothesis  that  Q5=5o-  where  Q  is  a  k  xk  matrix  of  full  rank  k  <k. 
Tests  of  this  hypothesis  can  be  based  on  the  Wald  statistic 
W=N(Q5-5o)  (QI.Q  )"  (Q5-5o),  which  has  a  limiting  X  distribution  with  k 

degrees  of  freedom. 

T        T 
We  now  turn  our  attention  to  the  estimation  of  g(x  5)=E(y|x  5).  and  add 

T" 

the  assumption  that  g( . )  is  twice  dif f erentiable .  Set  z.=x.  5,  i=l N,  and 

J   J 

T 
let  f  denote  the  density  of  z=x  S.    Define  g(z)  as  the  (Nadaraya-Watson ) 

kernel  estimator  of  the  regression  of  y  on  z=x  5: 


(3.8) 


J  =  l 


z  -  z 


h' 


V^ 


Ih 


,(z) 


where  f,,  ,  is  the  density  estimator: 
Ih 


N 
J  =  l 


z  -  z  . 

1 


h' 


with  bandwidth  h'=h'.  and  where  K,  is  a  symmetric  (positive  univariate)  kernel 

N  1 

T 
function.  Suppose,  for  a  moment,  that  z.=x.  8    instead  of  z.  were  used  in  the 

J   J  J 

(3.8)  and  (3.9),  then  it  is  weli  known  (Schuster( 1972) )  that  the  resulting 

regression  estimator  is  asymptotically  normal  and  converges  (pointwise)  at  the 

2/5 
optimal  (univariate)  rate  N    .  Theorem  3.3  states  that  there  is  no  cost  to 

using  the  estimated  values  z.  as  above: 


-1/5 
Theorem  3.3:  Suppose  z  is  such  that  f  (z)>b  >0.  If  N-*«.  h'~N    .  then 

2/5  " 
N   Ig.  , (z)~g(z ) J  has  a  limiting  normal  distribution  with  mean  B(z)  and 


variance  V(z),  where 


(3.10) 


B(z)  =  -J-    [g"(z)  +  2  g'(z)f^'(z)/f^(z)]  I  u^K^(u)  du 
V(z)  =  lVar(yix'5=z)/f ^(x)]  |  K^(u)^  du 


The  bias  and  variance  given  in  (3.10)  can  be  estimated  consistently  for 

"  z 
each  z  using  y,  g   and  f.  ,  and  their  derivatives,  using  standard  methods. 

Therefore,  asymptotic  confidence  intervals  can  be  constructed  for  g   (z).  It 

T"  T" 

is  clear  that  the  same  confidence  intervals  apply  to  m(x)=g.  ,(x  5).  for  z=x  5 


4.  Remarks  and  Discussion 


4 . 1  On  The  Average  Derivative  Estimator 

As  indicated  in  the  Introduction,  the  most  interesting  feature  of  Theorem 
3.1  is  that  8   converges  to  S   at  rate  il^ .    This  is  the  rate  typically  available 
in  parametric  estimation  proDiems.  and  is  the  rate  that  would  be  attained  if 
the  values  ii(x.),  1  =  1 N  were  known  and  used  in  the  average  (3.4).  The 


estimator  ft,  (x)  converges  pointwlse  to  Jl(x)  at  a  slower  rate,  so  Theorem  3.1 
n 

gives  a  situation  where  the  average  of  nonparametric  estimators  converges  more 

quickly  than  any  of  its  individual  components.  This  occurs  because  of  the 

overlap  between  kernel  densities  at  different  evaluation  points;  for  instance, 

if  X.  and  x.  are  sufficiently  close,  the  data  used  in  the  local  average  f,  (x.) 
1      J  hi 

will  overlap  with  that  used  in  f,  (x.).  These  overlaps  lead  to  the 

h'  J 

approximation  of  S   by  U-statistics  with  kernels  depending  on  N.  The  asymptotic 
normality  of  8   follows  from  results  on  the  equivalence  of  such  U-statistics  to 
(ordinary)  sample  averages.  In  a  similar  spirit.  Powell.  Stock  and 
Stoker (1987)  obtain  /n  convergence  rates  for  the  estimation  of  "density 
weighted"  average  derivatives,  and  Robinson( 1987 )  and  Hardle  and  Marron(1987) 
show  how  kernel  densities  can  be  used  to  obtain  4N  convergence  rates  for 
certain  parameters  in  specific  semiparametric  models.  We  also  note  that  our 
method  of  trimming  follows  Bickel (1982 ) ,  Manski(1984)  and  Robinson(1986 ) . 

For  any  given  sample  size,  the  bandwidth  h  and  the  trimming  bound  b  can 
be  set  to  any  (positive)  values,  so  that  their  choice  can  be  based  entirely  on 
the  small  sample  behavior  of  6.    The  conditions  (i)-(iii)  of  Theorem  3.1 
indicate  how  the  initial  bandwidth  and  trimming  bound  must  be  decreased  as  the 
sampJe  size  is  increased.  These  conditions  are  certainly  feasible:  suppose 
h=hoN   and  b=boN   .  then  (i)-(iii)  are  equivalent  to 

K   >  Tt   >   0 

1     ^    l-4ti-g 
2p-2   ^     2k+2 

Since  p  >  k+2  and  i    is  arbitrarily  small,  r\   can  be  chosen  small  enough  to 
fulfil  the  last  condition. 

The  bandwidth  conditions  arise  as  follows.  Condition  (ii)  assures  that 
the  estimator  S   can  be  "linearized"  to  one  without  an  estimated  denominator, 
and  is  a  sufficient  condition  for  asymptotic  normality.  Condition  (iii) 
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assures  that  the  bias  of  <5  vanishes  at  rate  Jn.  Conditions  (i)-(iii)  are  one- 
sided in  implying  that  the  trimming  bound  b  cannot  converge  too  quickly  to  0 
as  N-«<».  but  rather  must  converge  slowly.  The  behavior  of  the  bandwidth  h  as 
N-»<»  is  bounded  both  below  and  above  by  conditions  (11)  and  {ill). 

Condition  (ill)  does  imply  that  the  polntwise  convergence  of  fi^(x)  to 
f{x)  must  be  suboptimal.  Stone! 1980)  shows  that  the  optimal  polntwise  rate  of 
convergence  under  our  conditions  is  N        .  and  Collomb  and  Hardle(1986) 
show  that  this  rate  is  achievable  with  kernel  density  estimators  such  as 

(3.2):  for  instance,  by  taking  h   ^=hoN         .  But  we  have  that 

2p-2 
Nh       -»».  which  violates  condition  (Hi),  so  that  as  N-»<»,  h  must  converge 
opt 

to  0  more  quickly  than  h    .  The  reason  for  this  is  that  (ill)  is  a  bias 

opt 

condition:  as  N-*" ,  the  (ooinlwise)  bias  of  f.  (x)  must  vanish  at  a  faster  rate 

n 

-1/2 
than  its  (polntwise)  variance,  for  the  bias  of  S    to  be  o(N    ).  In  other 

words,  for  x*  N  consistent  estimation  of  8.    one  must  "undersmooth"  the 

nonoarametric  component  A.  (x). 

n 


4 . 2  On  Modeling  Multiple  Regression 

The  main  implication  of  Theorem  3.3  is  that  the  optimal  one-dimensional 

T        T 
convergence  rate  is  achievable  in  the  estimation  of  g(x  5)=E(y!x  5),  using  S 

T 
instead  of  5 .  We  have  assumed  that  mix),  and  hence  g(x  6),    is  twice 

dlf f erentiable ,  but  this  plays  no  role  except  to  affix  the  optimal  rate  at 

2/5        T 
N    .  If  g(x  S)    is  assumed  to  be  dlf f erentiable  of  order  q  and  K  ()  is  a 

kernel  of  order  q.  then  the  result  is  valid,  where  the  optimal  rate  of 

convergence  is  then  N        .  The  attainment  of  optimal  one-dimensional  rates 

of  convergence  is  possible  for  the  ADE  method  because  the  additive  structure 

T 
of  g(x  S)    is  sufficient  to  follow  the  "dimension  reduction  principle"  of 

Stone(1986).  Alternative  uses  of  additive  structure  can  be  found  In  Breiman 

and  Friedman  (1985)  and  Hastie  and  Tibshirani  (1986). 
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Finally,  we  turn  to  the  connection  between  the  ADE  method  and  projection 
pursuit  regression  (PPR)  of  Friedman  and  Stuetzle( 1981 )  .  The  first  step  of  PPR 
is  to  choose  p  (normalized  as  a  direction)  and  g  to  minimize 

TO  T 

s(g.p)=JI  [y.  -  g(x.  p)]  .   Because  any  model  of  the  form  m(x)=g(x  p)  is  fully 
inferred  by  m(x)=g   (x  S)    at  the  optimal  one  dimensional  rate  of  convergence, 
one  can  correctly  regard  the  ADE  method  as  a  version  of  projection  pursuit 
regression.  However,  for  a  general  regression  function  m(x),  g  and  S   will  not 

necessarily  minimize  the  sum  of  squares  s(g.p).  The  first  order  conditions  of 

~  ~   T 

that  minimization  implies  that  given  g.  p  is  chosen  such  that  {y.-g(x.  B)}  is 

~    T  -   - 

orthogonal  to  {x.g'(x.  p)),  which  does  not  necessarily  yield  $^5/(5|. 

Given  8.    the  ADE  method  utilizes  a  local  least  squares  estimator  g.  , ;  in 

Zz      T"  2 

K  [(Z"X.  5)/h'l(y.-t)   is  minimized  by  t=g   (z).  Moreover,  S 

can  be  seen  to  be  a  type  of  least  squares  estimator,  as  follows.  Let  S  denote 

-1    '  ~  T'  "  -1" 

the   sample  moment     S   =N     Eft,  . (x .  )ft,  .  (x .  )    I.,    and   set  L.  =  (S„)      ft.  .(x.)I..    Then 
^  ft  hiihjii  ifthiii 

S   can  easily  be  seen  to  be  the  value  of  d  that  minimizes  the  sum-of-squares 

"  T   2        "  '  j- 

E[y.  -  L.  d]  .  Thus  S   is  chosen  such  that  |y.  -  L.  §\   is  orthogonal  to  the 

subspace  spanned  by  (L),  or  equivalently ,  (S  )  S    represents  the  coordinates 

of  {y,}  projected  onto  the  subspace  spanned  by  {ft   (x.)I.}. 

Consequently,  when  the  true  regression  function  is  of  the  form 

~   T 
m(x)=g(x  p),  ADE  and  PPR  represent  different  methods  of  inferring  m(x).  The 

possible  advantages  of  the  ADE  method  arise  from  reduced  computational  effort, 

since  (given  h,  b  and  h')  m(x)=g(x  §)    is  computable  directly  from  the  data. 

The  first  step  of  PPR  will  in  principle  estimate  m(x) ,  but  minimizing  s(g,p) 

by  an  iterative  numerical  process  (of  checking  all  potential  directions  p  and 

determining  the  optimal  g  for  each  p)  typically  involves  considerable 

computational  effort  (although  the  results  of  Ichimura  (1987)  may  provide  some 

improvement) . 
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5 .  The  Average  Derivative  Method  In  Practice 

In  this  section  we  present  the  results  of  a  simulation  study  designed  to 
study  three  issues  of  the  ADE  approach: 

1.  The  performance  of  the  estimators  when  the  data  is  generated  by  a  simple 

parametric  model. 

2.  The  ability  of  the  ADE  approach  to  nonparametrical ly  capture  structures 

in  high  dimensions. 

3.  The  value  of  dimension  reduction,  as  expressed  through  the  one- 

dimensional  rate  of  convergence  of  m(x). 

The  first  issue  is  studied  by  generating  data  from  a  true  linear  regression 
model  and  estimating  the  coefficients  by  the  ADE  method.  The  second  issue  is 
studied  using  data  generated  by  a  highly  nonlinear  regression  model  (a  sine 
wave).  The  third  issue  is  addressed  by  comparing  the  final  ADE  regression 
estimator  with  a  multivariate  nonparametric  regression  smoother.  The 
simulations  utilize  k=4  predictor  variables  and  sample  sizes  of  N=50  and 
N=100. 

Our  Monte  Carlo  experience  Inaicated  that  for  relatively  small  sample 
sizes,  better  performance  of  8   was  obtained  using  a  standard  positive  kernel 
instead  of  the  higher  order  kernel  prescribed  by  Theorem  3.1  (for  k=4 .  a 
kernel  of  order  p=6  is  indicated).  In  particular,  the  order  of  the  kernel 
affects  the  tradeoff  between  bias  and  variance,  with  the  higher  order  kernel 
required  for  the  bias  to  vanish  at  rate  >rN.  However,  for  sample  sizes  in  the 
range  of  N=100.  we  found  the  small  sample  bias-variance  tradeoff  quite 
pronounced:  using  a  standard  positive  kernel  instead  of  a  higher  order  kernel 
implied  a  slightly  higher  bias  in  8    but  a  substantially  smaller  variance  in 
most  cases. 
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For  the  estimator  S.    the  kernel  function  K  was  constructed  as  the  product 

[^ 

of  one  dimensional  kernels ,  namely  K(u, u,  )=n  .  ,K  (u  . )  ,  where  K,  is  taken 

1      k   j=l  1   J  1 

to  be  the  univariate  "biweight"  kernel; 

K^{u)  =  (15/16)  (1  -  u^)^  I{|ui  <  1) 

For  estimating  g,  we  also  utilized  the  biweight  kernel. 

Our  theoretical  results  do  not  constrain  the  choice  of  bandwidth  h  or 

trimming  bound  b  for  a  given  sample  size,  and  we  study  the  behavior  of  5  for 

different  bandwidth  values.  We  did  find  that  somewhat  better  estimator 

behavior  resulted  when  a  slightly  higher  bandwidth  value  was  used  to  estimate 

density  derivatives  than  the  density  value,  and  so  results  on  8    for  bandwidth 

h  have  used  density  derivative  estimates  computed  using  h  =1.25h  (so  that 

fii-=fi-* ' /fi. )  •  This  does  not  affect  the  asymptotic  approximation  theory  of 
h  h*   h 

Section  3.  and  is  suggested  by  the  slower  pointwise  convergence  rates 
applicable  to  the  estimation  of  derivatives  (relative  to  estimation  of  density 
levels).  Finally,  because  the  value  of  the  trimming  bound  b  is  not  easily 
interpreted,  we  adopted  a  trimming  rule  suggested  by  Ray  Carroll,  to  set  b  so 
that  a  given  percentape  a  of  the  data  was  dropped  in  each  sample  (we  utilize 
the  values  <x=5%   and  a^ 1%   below,  although  the  results  were  not  particularly 
sensitive  to  the  value  of  a). 

5 . 1  Average  Derivative  Estimation  with  a  Simple  Parametric  Model 

It  is  natural  to  ask  how  a  method  that  accounts  for  nonlinearity  behaves 
when  the  sample  is  generated  by  a  simple  parametric  model.  In  order  to  study 
this  question  we  generated  samples  of  size  N=50  from  the  linear  model 


(5.1)      V.  =  I     j  X. .   +  (  .05)e. 

so  that  the  coefficient  of  x..  is  j,  where  x x,.,  and  e.  are  independent 
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standard  normal  variables.  Here  the  true  average  derivative  is  5=(1,2.3.4). 

Table  1  gives  the  means  and  standard  deviations  of  8   over  100  Monte  Carlo 
samples.  The  best  behavior  for  S   was  found  using  the  bandwidth  h=1.5,  and  we 
include  results  using  a  low  bandwidth  h-1.0  and  a  high  bandwidth  h=2.0  for 
comparison.  We  also  include  the  means  of  the  trimming  bound  b  over  the 
samples.  While  there  is  not  a  large  amount  of  "noise"  in  this  design  (the 
standard  deviation  of  the  additive  disturbance  is  .05),  we  find  the 
performance  of  8    to  be  good,  using  50  observations  to  estimate  k=4 
coefficients. 

5 . 2  Average  Derivative  Estimation  with  Nonlinear  Structure 

In  order  to  study  the  behavior  of  the  ADE  estimators  in  a  reasonably 
challenging  setting,  we  first  generated  samples  of  size  N=100  according  to  the 
model 

4 

(5.2)      V .  =  s  i  n ( E  X  .  . )  -   e  . 

J  =  l  '' 

where  x x. .  ana  e.  are  standard  normal  variables  as  before.  For  this 

li         4i       1 

model  we  have  m'=c()s(E.x..).  so  that  5^E(m '  )  takes  the  form  5^5„  ( 1 i )  with 

J  J 1  0        ' 

5  =.135  (using  formulae  given  in  Bronstein-Semand ja jew( 1974 ,  p.350)). 

Table  2  gives  the  means  and  standard  deviations  of  the  components  of  5 
over  100  Monte  Carlo  samples.  Here  we  found  the  coefficients  to  be  well- 
estimated  using  bandwidth  h^.9.  and  again  we  include  low  and  high  bandwidth 

results  for  comparison.  Finally,  we  present  the  means  and  standard  deviations 

+ 
of  the  "estimator"  8      that  uses  the  true  log-density  derivative  Jl{x)  instead 

+   -1 
of  the  estimator  A.  (x)  (namely  S    =N  i;y.ii(x.).  where  A(x.)=x.  for  x 

distributed  normally  with  mean  0  and  identity  variance-covariance  matrix). 

It  is  clear  that  for  the  bandwidth  h=.9  the  estimator  8   performs  quite 

+ 
well,  exhibiting  comparable  behavior  to  the  "known  density"  statistic  S    .    The 
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substantial  variation  of  S      is  indicative  of  the  high  level  of  noise  in  the 
basic  design  (the  standard  error  of  1  for  e).  At  any  rate,  we  find  the 
behavior  of  8    to  be  encouraging,  given  k=4  predictor  variables  and  N=100  data 
points . 

Although  not  reported,  we  also  performed  simulations  for  sample  size 
N=400.  We  found  good  behavior  of  S    for  somewhat  smaller  bandwidth  values  (in 
the  range  of  h=.6).  as  well  as  diminished  standard  errors  consistent  with  the 
quadrupling  of  the  sample  size. 

5.3  Dimension  Reduction  of  Average  Derivative  Estimation 

The  main  consequence  of  Theorem  3.3  is  that  the  ADE  regression  estimator 
exhibits  one-dimensional  rates  of  convergence.  We  study  this  effect  by 
comparing  the  ADE  regression  estimator  m(x)=g   (x  5 )  to  a  high  dimensional 
smoother,  namely  the  Nadaraya-Watson  kernel  regression  estimator  m   (x).  We 
utilize  the  data  sets  of  size  N=100  from  the  sine  model  (5.2).  We  utilize  the 
bandwidths  h=.9  for  computing  5  (and  a=5%),  and  h"=1.3  for  the  multivariate 

smoother  m,  „ .  For  computing  g,  ,  .  we  utilize  normalized  coefficients  S   =5/151, 

h  h 

and  bandwidth  h'=.3.  The  values  of  h"  and  h'  were  determined  by  cross 
validation  applied  to  the  first  few  Monte  Carlo  samples  (c.f.  Hardle  and 
Marron  (1985)) . 

We  begin  by  studying  the  overall  fit  of  the  ADE  estimator  versus  the 
multivariate  smoother.  For  this,  we  compute  the  difference  in  the  average 
squared  error  defined  as 

DIF  =  ASE~     -   ASE' 
m,  „         m 
n 

where 
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ASE~   =  n""^  y  [ni(x,)  -  m,„{x.)J^   :  ASE*   =  N  ^  Y  (in(x.)  -  g^,(xj5)J^ 
\"  iti    ^     h   ^         "»       x^t         '  ^       ' 

Figure  1  shows  the  variable  DIF  over  50  Monte  Carlo  samples.  It  is  apparent 

that  ADE  estimator  displays  smaller  average  squared  error  in  the  majority  of 

cases,  with  DIF  exceeding  .1  for  25%  of  the  samples,  and  less  than  -.1  for 

only  10%  of  the  samples. 

In  addition  to  the  gains  in  statistical  fit  from  dimension  reduction,  we 
have  argued  that  the  ADE  method  is  attractive  because  it  permits  a  graphical 
depiction  (a  graph  of  g,  , )  of  the  nonlinearity  exhibited  by  the  data.  Figures 
2  through  5  illustrate  this  feature  of  the  approach,  as  applied  to  a  single 
Monte  Carlo  sample  from  model  (5.2). 

Figure  2  shows  the  basic  data,  by  plotting  y.  against  the  true  weighted 

sum  z.=x.  8    .    where  5   ^  .5(1.1,1,1),  as  well  as  a  plot  of  the  ADE  regression 

J-* 

estimator  g,  ,  of  the  regression  of  v.  on  the  z.=x.  5  ,  where  for  this  example, 

h  "1         11 

-* 

5  =( . 54  .  .47 , . 59 , . 35 ) .  While  the  data  shows  considerable  noise,  the  basic  sine 

wave  pattern  is  evidenced  by  both  the  data  and  the  ADE  regression  estimator. 

Figure  3  displays  the  plot  of  the  ADE  regression  estimator  g,  , .  as  well 

as  the  plot  of  the  true  regression  m(x.)  against  z..  Figure  4  displays  g   as 

well  as  the  plot  of  the  kernel  smoother  g  ,  of  the  regression  of  y .  on  the 

true  z.  values  (using  h'=.3).  These  two  figures  point  out  that  the  ADE 

regression  estimator  tracks  the  true  regression  reasonably  well,  and  tracks 

the  smoother  based  on  knowledge  of  the  true  8   quite  well. 

Finally,  Figure  5  contains  a  plot  of  the  multivariate  smoother  m^„(x.) 

h   1 

against  z.,  and  the  true  regression  m(x.)  against  z..  While  one  can  discern  a 
sine  pattern  from  the  multivariate  smoother,  it  yields  a  considerably  less 
smooth  graphical  depiction  of  the  relationship.  Moreover,  the  jagged  feature 

T  * 

of  the  plot  is  an  artifact  of  plotting  m.  „(x.)  against  z .  =x .  8    ,  as  in  fact. 

n   1  11 

the  estimated  function  m.  ,  is  quite  smooth.  We  experimented  with  several 
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values  of  h",  with  larger  values  of  h"  moving  the  jagged  plot  toward  the 
horizontal,  and  smaller  values  of  h"  increasing  the  heights  of  the  peaks,  each 
obliterating  the  visual  tendency  to  a  sinusoidal  shape.  In  this  sense,  we 
regard  Figure  5  as  giving  the  best  showing  for  the  multivariate  smoother 
technique. 

6.  Concluding  Remarks 

In  this  paper  we  have  argued  that  the  ADE  method  represents  a  useful  yet 
flexible  tool  for  studying  general  regression  relationships.  At  its  center  is 
the  estimation  of  average  derivatives,  which  we  propose  as  sensible 
coefficients  for  measuring  the  relative  impacts  of  separate  predictor 
variables  on  the  mean  response.  As  such,  we  propose  the  ADE  method  as  a 
natural  outgrowth  of  linear  modeling,  or  "running  (OLS)  regressions."  as  a 
useful  method  of  data  summarization. 

As  a  procedure  for  inferring  regression,  the  ADE  method  displays  one- 
dimensional  rates  of  convergence,  and  as  such  achieves  theoretical  dimension 
reduction.  But  of  greater  practical  importance  is  the  economy  achieved  with 
respect  to  data  summarization.  Instead  of  facing  the  nontrivial  task  of 
interpreting  a  fully  flexible  nonparametric  regression,  the  ADE  method  permits 
the  "significance"  of  individual  predictor  variables  to  be  judged  via  simple 
hypothesis  tests  on  the  value  of  the  average  derivatives.  Moreover,  remaining 
nonlinearity  can  be  assessed  by  looking  at  a  graphical  picture  of  the  function 

While  our  simulations  of  the  ADE  approach  are  encouraging,  there  are  many 
future  research  topics  suggested  by  our  results.  For  instance,  there  are 
numerous  practical  questions  concerning  choice  of  bandwidth  for  the  average 
derivative  estimator,  such  as  how  to  set  the  bandwidth  optimally.  While  we 
have  seen  some  basic  tendencies  in  the  Monte  Carlo  simulations,  practical 
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problems  are,  of  course,  never  so  styJized.  The  same  remarks  apply  to  the 
rules  for  trimming,  although  in  our  Monte  Carlo  experience  we  found  no 
substantia]  effect  of  varying  the  trimming  rule. 

The  ADE  estimators  are  implimented  as  part  of  the  exploratory  data 
software  package  XploRe  of  Hardle( 1987 ) .  In  addition,  David  Scott  of  Rice 
University  has  written  a  very  efficient  code  (for  the  S  package)  for  computing 
the  ADE  coefficients,  which  is  available  from  the  authors. 
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Appendix:  Assumptions  and  Proofs  of  Theorems 
Assumptions  for  Theorem  3.1. 

1.  The  support  fi  of  f  is  a  convex,  possibly  unbounded  subset  of  R  with 

nonempty  Interior.  The  underlying  measure  of  (y,x)  can  be  written  as  v  xv  , 

y  X 

where  v  is  Lebesgue  measure. 

2.  f(x)=0  for  all  x€dO.  where  dfi  is  the  boundary  of  Q. 

3.  m(x)=E(y|x)  is  continuously  differentiable  on  fi  c  n.  where  Q   differs  from  0 
by  a  set  of  measure  0. 

T   2  T 

4.  The  moments  E(A   fty  )  and  E[(m')  (m')]  exist. 

5.  All  derivatives  of  f(x)  of  order  p  exist,  where  p>k+2. 

6.  The  kernel  function  K(u)  has  bounded  support  S={ul  |u|<i},  is  symmetric, 
K(u)=0  for  u€dS=|u!  (u|^l>,  and  is  of  order  p: 

J  K(u)du  =  1 

i^  Co' 

I  u,  ■  .  .  .u,  •   K(u)du  =    0        C  +...  +  £  ,  <  p 
Ik  1      p 

e,     Cp ' 
r  u   •  . . .u     K(u)du  *  0        e  +. .  .  +  C  ,  ^  p 

"    1        K  \  p 

(Such  kernels  are  easily  constructed  by  multiplication  of  one  dimensional 
kernels  -  see  Gasser.  Muller  and  Mammitzsch( 1985)  for  examples  of  such  one 
dimensional  kernel  functions). 

7.  The  functions  f(x)  and  m(x)  obey  the  following  Lipschitz  conditions:  for  v 

in  an  open  neighborhood  of  0.  there  exists  functions  w..  u,,.  w  ,  and  u.   such 

f   f    m'      Am 

that 


lf(x+v)  -  f(x) I  <  w^(x)|v| 
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f  '  (X-tv)  ■   f  '  (X)l  <  (Uj,  ,  (x)|vl 


|m'  (x+i/)  -  m'(x)|  <  oj  ,(x)lvl 

m 


I  fi(x+v)m(x+v)  -  Jl(x)m(x)|  <  a)„  (x)|vl 

Jim 


with  E[(ftyb.^)^]<».  EL(yu)^,  )^]<-».  E(u)^,^]«»  and  E[u.^^^]«».  M2(x)=E(y^  |  x)  is 
continuous  in  x. 


Let  A, ={xlf (x)>b)  and  B,=n\A„. 

N  N     N 


8.  As  N-«. 

r  m(x)f'  (x)dx  =  o(n""'^^) 

9.  If  f     denotes  any  p   order  partial  derivative  of  f,  then  f     is 

*■    „  I- 

locally  Holder  continuous:  there  exists  c(x)  and  Y>0  such  that 

If    (x+v)-f    (x)l<c{x)lvi  .  The  p+Y  moments  of  K( . )  exist.  Moreover 

r  m(x)f'''Nx)dx<M<« 

\      ' 

Y       ~ 

h   J   c(x)m(x)dx<M<"» 

^^ 

hf  m(x)Jl(x)f ''^'(x)dx  <  M  <  « 


h"'"'''  ;   c(x)m{x)fi(x)dx  <  M  < 


Assumptions  8  and  9  are  conditions  on  the  behavior  of  m(x)  and  f(x)  in 
the  tails  of  the  distribution.  For  Assumption  8,  some  sufficient  conditions 
for  the  univariate  case  k=l  are  as  follows,  with  multivariate  analogues  easy 
to  derive. 


The  support  fi  is  the  real  line  R.  Let  L.,=inf{lxl  I  f(x)<b}.  so  that  L^,-»<». 
N  N 

Consider  the  situation  where  |m(x)f'(x)i<  C  G(x) ,  where  G(x)  is  a  density. 
Then 
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/   mfx)f' (x)dx  <  C   J   G(x)dx 

<  C  Prob^{  xi>L„} 

G      N 

<  C  E^nxl'^J/L^ 


Therefore,  if  G(x)  has  absolute  moments  of  order  d,  where  d  is  such  that 

iNL~     -»  0.  then  i 

N 

a  normal  density. 


/nL„   -»  0.  then  Assumption  8  obtains.  A  sufficient  condition  is  that  G(x)  is 

N 


The  support  fl  is  bounded.  The  density  f(x)  must  be  smooth  and  flat  near  the 

boundary.  Suppose  that  Q=lO.Xol,  f(0)=0,  f{L.,)=b.,,  f(x)  is  increasing  over 

N    N 

[O.L.J,  and  m(x)  is  bounded  over  [0,L,,].  If  we  expand  f'(x)  in  a  Taylor  series 

(2)  (-11 

about  0.  and  assume  that  f'(0)=f    (0)=  ...  -f"^   (0)=0,  then 

J    f'(x)dx  =  0(L  ''l 
[O.L^] 

Consequently,  if  q  is  large  enough  such  that  xTnL  -♦0,  then  Assumption  8  is 
valid. 

Assumption  9  is  written  in  a  weak  form  required  for  Theorem  3.1.  For 
instance.  Assumption  9  is  clearly  valid  if  the  support  Q  is  bounded  and  m(x) 
is  bounded. 


To  prove  Theorem  3.3.  we  also  require  that  m(x)=E(y|x)  is  twice  dif f erentiable 
for  all  X  in  the  interior  of  Q. 
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Proofs  of  the  Main  Results 

We  beein  with  two  preliminary  remarks.  First,  equation  (3.1)  follows  from 
integration  by  parts  as 

5  -  j  m'(x)f{x)dx  --     -  I  m(x)f'(x)dx  -  |  m(x)ft  (x)f  (x  )dx  =  E(fiy) 

where  the  boundary  terms  vanish  by  Assumption  2  (c.f.  Beran(1977). 

Stoker( 1986) )  and  the  last  equality  is  by  iterated  expectation. 

Second,  because  of  condition  (iii).  as  N-*<».  the  pointwise  mean  square 

errors  of  f,  and  f^ '  are  dominated  by  their  variances  (as  discussed  in  Section 
h      h  _ 

4.1).  Therefore,  since  the  set  {x|f(x)>b)  is  compact  and  b  h-»0 .  we  can  assert 
(c.f.  Collomb  and  Hardlel  1986) ,  Silverman( 1979) : 

(A. la)     sup  If  (x)  -  f(x)l  Itf(x)>b]  =  0  (  (N^'' ^^^  ^h'^)"'^'^^] 

(A. lb)     sup  |f^'(x)  -  f'(x)|  I[f(x)>b]  =  0  [(N^^^^^^'h*^*^)'^^^] 

for  any  £>0. 

Now  define  two  (unobservable )  "estimators"  which  are  related  to  5.  First, 
define  the  estimator  S    based  on  trimming  with  respect  to  the  true  density 
value : 

N 


(A. 2)      5   =  ~-  I  ft.  (x.)y.I. 
N  .'- .      h   111 


i  =  l 


where  i  =1 i  f (x , )>b j ,  i  =  l N  .  Next ,  define  a  linearization  8 


(A. 3) 


wnere 


N 

I 
i  =  l 


1      ■ 

^  =  -F"  I  >^v,(x-  )v.I. 
K  .'-,      h      111 


f  '      f 

(A. 4)       X,  =ft   -  -^      -      — ^fi 
h         f       f 


Proof  of  Theorem  3.1:  The  proof  consists  of  four  distinct  steps,  summarized 
as: 


Step   1.    Linearization:    474(5    -  5)    =   o    (1) 
p 
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step  2.  Asymptotic  Normality:  -[nIS    -   E(5)]  has  a  limiting  normal 

distribution  with  mean  0  and  variance  Y.. 
Step  3.  Bias:  [E{§)    -  5]  =  o(N~''^^). 
Step  4.  Trimming:  '^N(5  -  3)   has  the  same  limiting  distribution 

as  >fN(5  -  S)  . 


The  combination  of  Steps  1-4  yields  Theorem  3.1 


Step  1.  Linearization :  Some  arithmetic  gives 
i  f^(X.)  f(x.) 

tnx.)  -f,(x.)j^ 

-  N     )   — :: Ax.  V.I. 

*-  111 

1    f,(x.)  f(x.) 
so  that  by  (A. la),  there  is  a  constant  c„  such  that  with  high  probability 

rmS    ~Sl    <  ''   u/2).k,-l/2   -Pllf-fh'^-'  suplif'-f  jl]-^ 

b  -be,.  (N       n  )       X  x 


supf If-f.  il]"     ^  ^  ^ 


^2.   ,,,l-(£/2)^k,^l/2   "-^''^   h   '        N 
b  -DC  (N       h  )       X 

The  terms  N  Elv.il.  and  N  Tj ft (x . )v . 1 1 .  are  bounded  in  probability  by 
"11         'I'li 

Chebyshev's  inequality.  Consequently,  from  (A.la.b)  we  have  that 

rrr  ^    5,    ^  , .  2^,^  (  1 /2  )  +  (  € /2  ) .  -  (  2k  +  2  ) /2  ^      ,,. 
>lN|5-5|=0(bN  n         )-o(J) 

,2A-(t/2).k  ,   4  l-£,2k+2   ^     ^.^.    ,.., 

since  b  N       n  -♦"  and  b  N   h    -»»  by  condition  (ii). 


Step  2:  Asymptotic  Normality:  Write  the  linearization  5  as 


3=5   +5   +5 
0    1    2 


where 
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5  „  =  N  '  y  il(  X  .  )  V  .  I  . 
0       h  111 


f.'(x.) 

.,-1  V  h a    ^ 

N   )  — r- —  y  .  I  . 

h       t (x  .  )    3  1 

1      1 


f  (X.) 

1     1 


We  show  that  /n(5  -  E(5)l  has  a  limiting  normal  distribution,  by  showing 

that  5  .  5  and  5   are  4n  equivalent  to  (ordinary)  sample  averages,  and 

appealing  to  standard  central  limit  theory.  Throughout  this  section,  we  denote 

(y..x,)=v..  For  5„,  we  have  that 
'ill       0 


(A. 5)      ^\%^   -  E(5q)]  =  N 


-1/2 


I  {Fpiv.)  -  E[rQ(v)]) 
L  i  =  l 


.  Op(l) 


where  r  (v)  =  ft(x)y. 
since  Var()ly)  exists  and  b-»0  and  N-*". 

To  analyze  5   and  5  .  we  approximate  them  by  U-statistics .  The  U- 

statistic  related  to  5,  ran  be  written  as 

1 


N- 1     N 


^    I-  2  J     >i   j4.1  ^^   ^   J 


with 


P , ..  (  V  .  ,  V  . ) 

IN   1   1 


X  .  -  x  . 

^ I 


y.i. 
11 


Lf(x.)   f(x.) 
i       J 


V  .1  . 


where  K'=dK/au.  Note  that  by  symmetry  of  K( . ) ,  we  have 

Vk\%^    -   E(5^)]  =  >|'N[U^  -  E(U^)J  -  N~"'{>rN[U^  -  E(U^)]} 


The  second  term  in  this  expansion  will  converge  in  probability  to  zero 
provided  >Il4fU  -E(U  )  J 
Consequently,  we  have 


provided  >rNfU  -E(U  ) 1  has  a  limiting  distribution,  which  we  show  later. 
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(A. 6)  >[n[5,    -    E(5J]    =   >rNrL'      -    E(U,)]    +    o    (1) 

11  lip 


The  U-statistic  related  to  S      is 


N-1     N 


1=1    J=l+1 


where 


D   ( V  ,  V 


i-  [-h]   K 


X  .  -  X  . 

_J 1 


rfi(x. )y.I.   &{x.)   y.I . 


f(x.) 
i 


f(x^) 


U  is  related  to  8      via 


>rN|52  -  Eia^) 


^N[U2  -  E{U2)]  -  N  ^{m\^^    -   ElU^)]} 


r  1 


JN 


i  =  l 


LNn  J 


K(0) 


rMx.)y^I.l 


f(x.  ) 

1 


-  E 


rMx.)y.l.i 


f  (x.  ) 

1 


As  above,  the  second  term  of  the  expansion  will  converge  in  probability  to 
zero  provided  >1N(U  -E(U  )1  has  a  iimitint^  distribution,  as  shown  later. 
The  third  term  converges  in  probability  to  zero,  because  its  variance  is 
bounded  by  K(0  )^(  l/Nh*^  )^(h/b)^Enfi(x  .  )y  .  )^I  .  )=o(  1 )  ,  since  Nh*^-*»  and  h/b-*0. 
Therefore,  we  have 

(A. 7)      >fN[52  -  EtSg)]  =  >rNfU2  -  £(0^)]  +  0^(1) 

The  analysis  of  U  and  U   is  quite  similar,  so  we  only  present  the 
details  for  U  .  We  note  that  U   is  a  U-statistic  with  varying  kernel  (for 

instance,  see  Nolan  and  Pollard  (1987)),  since  p   depends  on  N  through  the 

1P« 

bandwidth  h.  Asymptotic  normality  of  U  can  be  shown  using  Lemma  3.1  of 

2 
Powell,  Stock  and  Stoker ( 1987 )  .  which  states  that  if  E[!p   (v.,v.)|  J=o(N). 

then 


(A. 8)      >fN[U   -  E(U  )1 


-1/2 


N 

I 
i  =  ] 


I   {rjj^(v.)  -  E[r^j^(v)J} 


.  Op(l) 


where  r..,(vl 
IN 


2E[p   (v,v  .) ivl 


This  condition  is  impJJed  bv  (ii):  let  M  (x . )=E(y . I . I x . )  and 

11111 
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M^(x.)HE(v.    I.lx.).    then 
2      1  '111 


Ellp^^.(v..v., 


4b  h 


X  .    -    X  . 

_i 1 


(M    (x.)+M    (X    )-2M    (x.)M    (X.)]    f (x .  )f (x  .  )dx  dx  . 

ij         1  ^         J  iilj  J  JiJ 


4b^h 


uk  +  2  J 


I    K' (u) I    IM    (X. )+M    (x.+hu)-2M    (X. )M    (x.+hu)  J    f (x. )f (x . +hu)dx .du 

^J.b.i  A  A.  A  A.  A.  S  A 


=   Olb^^h''"'^')    =   0[N(b^Nh'''^)'^]    =   o(N) 

2   k+2 
since  b  Nh   -♦<«>  is  impiied  by  condition  (ii).  Therefore,  (A. 8)  is  valid. 

Now  let  b  =sup   {f  (x+hu) If  (x)=b  and  |u|<l}  and  I.  =I[f(x.)>b  J.  Note 

X  .  u  1         1       jj, 

that  bv  construction,  if  u  is  such  that  iui<l.  then  If f  (x . +hu )>bl-l  ,   rf  0  only 

1         1 

when  (1-1.  )--=!,    and  that  b  -•0 .  h/b  -•0  as  b-»0  and  h-»0.  Now  write 

1 

r,.,(v.  )=E[2p,^,(v.  ,v  . )  I  V.  1  as: 
IN   1      IN   1   J    1' 


IN   1 


X .  -  X 

1 


y,i 


i  1    m(x)I[f (x)>bj 
f(x) 


Lf (X. ) 

1 


f (x)dx 


y.i.  p 

77-^   -r-  K'(u)  f(x.+hu)du 
f  (  x  .  )  J   h  1 

1 


-; —  K'(u)  ni(x.+hu)du 
h  1 


1  -  I  .  ) 

1 


4-   K'(u)  m(x.+hu){I(f (x.+hu)>b]-l  .  }du 
h  1  1         1 


where 


y.i.  p 

^]    \   K(u)  f'(x.+hu)du 
f (x. )  J  1 


+  I 


K(u)  m'(x.+hu)du  +  (1  -  I.  )  a(x.:h.b) 
1  11 


a  (  X  ,  :  h  .  b  ) 

1 


4-  K'(u)  m(x.+hu){lff (x.+hu)>bl-I .  }du 
h  1  1      '   1 


Now,  if  r  (v.)  is  defined  as 


r, (v. )  =  ft(x.  )y .  +  m' (x. ) 
1   1       1   :       1 
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then  the  difference  between  i_.,(v.)  and  r^(v.)  is 

IN  1       1   1 

1 

+  I.*  I  K(u)[m' (x.+hu)  -  in'(x.)]du  +  (1  -  I.)a(x.)y. 
ij  1  1  111 

+  (1  -  I.  )m' (x. )  4  (1  -  I .  )  a(x. ;h.b) 
11  11 

2 
The  second  moment  E[t,„{v.)  ]  vanishes  as  N-»<»:  By  the  Lipschitz  conditions  of 

IN   1 

Assumption  7  the  second  moment  of  [y .  ]  . /f  (x  .  )  ]  J"K(u)  [f  '  (x  . +hu)-f  '  (x  .  )  ]du  ,  is 
bounded  by  (h/b)^{ J | u | K(u )du)^E[y^w   (x )^]=0[ (h/b)^]=o(i ) .  The  second  moment 
of  I.  J"K(u)  [m' {x.+hu)-m' (x.  )  ]du  is  bounded  by 
h^(;|u|K(u)du)^E(w  , (x)^J=0(h^)=o(l) .  The  second  moments  of  ( l-I . )ft (x  )y .  and 

^  III  ^  ^         ^     ^ 

(l-I.  )m'(x.)  vanish  by  Assumption  4  since  b-»0  and  b  -►0.  Finally,  the  second 

moment  of  (l-I.  )a(x.:h,b)  vanishes  if  the  second  moment  of  a(x;h,b)  exists, 

*  ^  ^  th 

since  b  -♦0.  Now  consider  the  e   component  a  (x;h,b)  of  a(x;h,b),  and  define 

the  "marginal  kernel"  K   =jK(u)du  and  the  "conditional  kernel"  K  =K/K   . . 

For  a  given  x,  integrating  a  (x;h,b)  by  parts  absorbs  h   .  and  shows  that 

a  (x;h,b)  is  the  sum  of  two  terms:  first  the  expectation  (w.r.t.  K(u))  of 

*  * 

m  ' (x+hu) { I Lf {x+hu)>b) ]-I [ f (x)>b  ]},  and  second  the  expectation  (w.r.t.  K    ) 

of  K  m(x+hu)  over  u  values  such  that  f(x+hu)=b.  Because  the  variances  of  m' 

and  y  exist,  the  second  moment  (over  x)  of  each  of  these  terms  exists,  so  that 

2 
E[a  (x;h,b)  ]  exists.  Consequently,  since  the  second  moment  of  each  component 

^  * 

of  a(x;h,b)  exists,  we  have  that  the  second  moment  of  (l-I.  )a(x;h,b) 

2  ^ 

vanishes,  so  that  Eit^.Jv.)  |=o(l). 

iN  1 

This  fact  is  sufficient  to  show  asymptotic  normality  of  U  ,  because 

-1/2  ^  1/?  '^' 

(A. 9)     N     )_    ^r^j,(v.)-E[r^j^(v.)]}  =  N  '   )_    {r^  ( v.  )-Elr^  ( v  .  )  ]  > 

i  =  i  i  '^  1 

1  =  1 

and  the  last  term  converges  in  probability  to  zero,  since  its  variance  is 

2 
bounded  by  Ett   (v)  1  =  o(l).  Consequently,  combining  (A. 9),  (A. 8)  and  (A. 6), 

we  conclude  that 
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(A. 10)     ^[S^    -  E(5j)]  =  N 


-1/2 


I    {r  (V  )  -  E[r  (v)J) 
L  1  =  1 


0^(1) 


where  r(vl  =  fi(x)y  +  m'(xl 

The  U-statistic  representation  U  of  5   is  anaJyzed  in  a  similar  fashion. 

2  2   k 

In  particular.  Eilp   (v..v.)|  ]  =  o(N)  if  b  Nh  -•«>.  which  is  implied  by  condition 

(ii).  By  an  analogous  argument,  [U  -  E{U  )]  is  shown  to  be  >fN  equivalent  to 

the  average  of  [r  (v.)  -  E(r  (v))],  where  r  (v)=  [ft(x)y  +  ll(x)m(x)l.  Combining 

this  with  (A. 6)  permits  us  to  conclude  that 


(A. 11 


fii[a. 


^(S^)]    =   N 


,-1/2 


N 

I 
i  =  l 


I    (r^iv.)    -   E[r2(v)]) 


+     0(1 

P 


where   r    (v)    =   -[ft(x)y   +  )l(x)m(x)] 


Combining  (A. 5),  (A. 10)  and  (A. 11)  then  yields  Step  2,  as 

N 

I   {r(v.)    -   E[r(v)]} 


>|N(5    -   E(5)]    =   N 


•1/2 


L    1  =  1 


.  Op(l) 


with   r(v)   H  ^o'v'    ^    r    (v)    +   r    (v)    =  m'(x)    +    [y   -  m(x)]A(x)    , 


with  r(v)  B.   r(y,x)  in  the  statement  of  Theorem  3.1 


Step  3:  Bias :  Expand  the  bias  of  5    as 


EiS)    -    S    -   T^^    -   T^j^    .    T3^ 


wiiere 


T,,,   =   eTn"      y      Jl(x.  )v.I  .1    -   5 
IN  L         .^,  11    iJ 

1  =  1 

^2N    =    4^"\^,     'fh'-i'    -    f''^'J?f^. 


1  =  1 


-^SN    =    ^ 


N  „  ft(x. )y.I. 

N        I  [f.  (x.)    -    f(x.)]   — TT 7-i 

L        .^,        hi  1  f (x . ) 

1  =  1  1 


Let  A,..  B,.  be  defined  as  before.  Then 
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"^IN  "  1  <^'x)'"<x)f''«^°'«  -  I  )l(x)ni(x)f  {x)dx  =  I  m(x)f' (x)dx  =  o(N  ^^^) 

by  Assumption  8. 

-1/2  -1/2 

We  only  show  that  t   =o(N    ),  with  the  proof  of  t  =o(N    )  quite 

similar.  Let  i  denote  an  index  set  (c e  ),  where  rc.=p.  For  a  k-vector 

u=(u  "■,)•  define  u  =u   u    ...u    .  and  let  f     denote  the  p   partial 

IK  1    ^        K  t 

derivative  of  f  with  respect  to  the  u  components  indicated  by  i,  namely 
f    =a  f/iau)*".  By  partial  integration  we  have 


T   =  J  m(x)  j  K(u)  [f'(x-hu)  -  f'(x)]  du  dx 


\' 


I  m(x)  I   \   K(u)  h'^'-^f *'''(^)u''  du  dx 


where  the  summation  is  over  all  index  sets  i  with  18. =p,  and  where  t    lies  on 

J 


the  line  segment  between  x  and  x-hu.  Thus 

=  h"^"-"!  m(x)  Y   f^^Nx)   K(u)u''  du  dx 


^2N 


hP-^ 


m(x)  Y 


K(u)  [f^P\^)  f'P'(x)  I  u'  du  dx  =  0(hP  ^) 


I       I 


by  Assumption  9.  Therefore,  by  condition  (iii).  we  have 

-1/2   1/2  D-1      -1/2  ~       -1/2 

T   -0[N     (N   h    )]-o(N     ).  as  required.  Thus  E(a)-5=o(N     ). 

Step  4:  Trimming:  Steps  13  show  that  >iN(<5  -  S)    has  a  limiting  normal 
distribution  with  mean  0  and  variance-covariance  matrix  5.  We  now  show  the 
same  result  for  S.    Let  4>(Z)  denote  the  cummulative  distribution  function  of 

the  normal  distribution  with  mean  0  and  variance  T. 

l-(€/2)  k  -1/2 
Let  c  =c„(N       h  )     ,  where  c.  is  an  upper  bound  consistent  with 

(A. la).  Define  two  new  trimming  bounds  as  b  =b+c„  and  b^=b-c^,.  and  the 

u    N      e    N 

associated  trimmed  kernel  estimators: 
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1     "^    - 

5   =  -TT  I  A,  (x.)y.l[f(x. )>b  1 
1  =  1 


N 

I 

i  =  l 


^  =  -f  .^.  ^h'^'i'^i^'f'^'^N 


Since  b  c  -»0  by  condition  (ii),  5   and  5   each  obey  the  tenets  of  steps  1-3, 

so  fN(5    -S)    and  ^(S    -8)    each  have  a  limiting  normal  distribution  with  mean  0 
u  * 

and  variance  1.   (whose  cumulative  distribution  function  is  denoted  4). 
Moreover,  by  construction,  we  have  that 

Prohi^iS    -S)<Z.    If.  (x.  )-f  (X.  )  |<c^,.  i  =  l N} 

u         hi      1    N 

<  Prob{>rN(5-5  )<Z.    |  f^  (x  .  ) -f  (x  .  )  i<c„.    i  =  l N} 

hi  1  N 

<  Prob<NfN(5    -5  )<Z.     i  f^  (x  .  )-f  (x  .  )  |<c„.    1  =  1 N} 

C         h   1     1     N 

By  (A. la),  as  N-»<».  Prob{supl  f,  (x .  )-f  (x .  )  |  >c.,>  •*   0.  Consequently,  as  N-»<».  we 

n   1      1    N 

have 

<i>(Z)  <  lim  Prob{>rN(5  -  5)<Z»  <  $(  Z ) 


so  that  lim  Prob{>rN(5-5  )<Z}=<I(  Z  )  . 


Proof  of  Theorem  3.2:  The  estimator  r,  .  is  constructed  by  direct  estimation  of 

hi 

the  U-statistic  component  structure  of  S.    First,  set  r^„(v . )=ft,  . (x. )y . I . .  Next 

ON   1   hi   1   1  1 

define  p,,,(v..v.)  and  p„,,(v.,v.)  by  replacing  f(x.),  ft(x.).  I.,  f(x.),  ft(x.), 
IN   1   J       2N   1   J  1      11     J      J 

I.  by  f .  .  ( x .  )  .  A.  Ax.),    i.,  f,  .(x.),  ft,  .(x.),  I.  respectively,  in  the  formulae 
J     hi   1    hi   1    1   nj   J    hj   J    J 

defining  p,.,(v..v.)  and  p„^.(v.,v.).  Finally,  define 
IN   1   J       2N   1   J 

r..,(v.)=2N  Z  .p,.,(  V  .  .  V  . )  and  r„.,(v.)=2N  E  .p„.,(  v  .  ,  v  . )  .  By  techniques  similar 
IN   1        J  IN   1   J        2N   1        J  2N   1   J 

to  Collomb  and  Hardle(1986)  and  Silverman( 1979 ) ,  by  construction  we  have  that 

supl  [r^^,(v  .  )  +  r  ,,^,(  v  .  )+r^,^.(  V  .  )  -  r(v.)]I,l=o  (1).  The  estimator  r,  .  of  (3.5)  is 
ON   1    IN   1    2N   1       lip  hi 

iust  r,   =r   (v)^r   (v)  +  r   (v). 
-      hi    ON   i    IN*  i'   2N*  i' 

By  an   argument   similar   to   that   of   Step  4   above,    it   suffices   to  prove 

-I      '      ~      1  ""T 

consistencv  of  7:  =   N      L   r.  .  r.  .    I  .    -  5<5    .    Set   r.sr(y.,x.),    then 

ninii  111 
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T      T 
-  [E(rr  )  -  55  J 

=  N~"^j:(r,  .-r.  )(r^  .-r.  )'^I  .  -*  n'^l   r  .  (r^  . -r  .  )^I  .  +  N""^E(r^  . -r  .  )r  .'^ 
hi   1   hi   1    1         1   hi   1   1         hi   1   1 

-IT  -ITT    **T     T 

-  N  Zr.r.  (l-I.)  +  N  £r . r .   -  E(rr  )  -  55   +  55 
11     1         11 

=  o  (1) 
P 


since  supl r,  .-r . I  I . =o  (1),  E(rr  )  exists,  Prob{f (x)<b}=o( 1 )  and  5  is 
hi   1   1  p 

consistent  for  5 . 


Proof  of  Theorem  3.3: 

T  " 
With  z.--=x.'5,  define  a  .  =  z  .-z  .  .    Since  f,(z)>b  >0.  d.=x.  (5-5) 

J   J  J   J   J         1   '   1     J   J 

T     -1/2  - ( 1/2 ) 

=x.  0  (N    ).  so  that  sup.{d.}=0  (N      ).  Define  the  kernel  regression 
J   P  J   J   P 

function  estimator  using  z.  instead  of  z .  as 

1  1 


g,  ,(z) 
h 


N     ,       fz  -  z  .1       , 

J  =  l  /   1 


^h''^^ 


where  f,,  ,  is  the  density  estimator: 
Ih 


j=i 


-1/5 
When  h'^'N    .  it  is  a  standard  result  (e.g.  Schuster  1972)  that 

N   [g,  , (z) -g(z )  I  has  the  asymptotic  distribution  given  in  Theorem  3.3. 

Consequently,  the  result  follows  if  g   (z)-g   (z)=o  (N    ). 

First  consider  f   ."^iki-  ^V  applying  the  triangle  inequality  to  the 

Taylor  expansion  of  f    (z)  we  have 


+   sup{d^}  h''-^|N"-^h'"^  l   (K  )' '  [(Z-?  .)/h' 1) 
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where  t.    lies  between  z.  and  z..  Therefore  f ,,  , ( z  )  f .^ , ( z  )=0  (N      ) 

M  J      J            1  h '     1  h '     p 

-2/5  ~               ~               ~              y^                        -2/5 
=0  (N    ).  Bv  a  similar  argument,  f,,  ,(z)e.  , ( z  ) -f ,,  , ( z )e,  , ( z) =o  (N    ) 

p  Inn      inn      p 

~  -2/5 

Consequently  g   ( z ) -g  ( z ) =o  (N     ).     QED  Theorem  3.3. 
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Table  1:   ADE  Estimation  of  the  Linear  Model  (5.1) 
N  =  50   a  =  1% 


Low  Bandwidth 

High  Bandwidth 

h  =  1.5 

h  =  1.0 

h  =  2.0 

.9713 

.5059 

1.0403 

(.2775) 

(.1556) 

(.2634) 

2.0389 

1.0587 

2.0410 

(.2476) 

(.1542) 

(.3408) 

3.0366 

1.6328 

3.1858 

(.2940) 

(.2163) 

(.3779) 

4.0158 

2.1354 

4.1292 

(.2661) 

(.2568) 

(.4385) 

.0015 


,0077 


,0005 


Table  2:   ADE  Estimation  of  the  Sine  Model  (5.2) 
N  =  100   a  =  5% 


Low  Bandwidth  High  Bandwidth 

^^9  h=.7  h=1.5  Known  Density 

o  1134  .0428  .1921  .1329 

*1  (:0960)         (.0772)         (.1350)  (.1228) 


.1340 
(.1192) 


r  1154  .0529  .1837  .1330 

3  (".1008)         (.0841)         (.1169)  (.1145) 


r  1303  .0591  .2042  .1324 

^  ('.0972)         (.0957)         (.1098)  (.1251) 

b  .0117  .0321  .0017 


A 

So 

.1356 

.0449 

.1982 

°2 

(.1093) 

(.0640) 

(.1283) 

10 


.20  -.15  -.10  -.05  0.0  .05  .10  .15  .20  .25  .30  .35        .40 


Figure  1:  Histogram  of  Observed  Differences  Between  Average  Squared  Errors:  ADE  Regression  and 
Multivariate  Nonparametric  Regression. 
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Figure  2:   Basic  Data  and  ADE  Regression 
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Figure  3:   ADE  Regression  and  the  True  Regression  Curve 
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Figure  4:   ADE  Regression  and  Regression  Based  on  Known  5 
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Figure  5:   Multivariate  Nonparametric  Regression  and  the  True 
Regression  Curve 
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